Featurization & Model Tuning(FMT) Project - 5th Project submitted for PGP-AIML Great Learning on 06-Feb-2022
• DOMAIN:Semiconductor manufacturing process
• CONTEXT: A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of
signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a
specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise.
Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then
feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key
factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to
learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and
trying out different combinations of features, essential signals that are impacting the yield type can be identified.
• DATA Description:
sensor-data.csv : (1567, 592)
The data consists of 1567 datapoints each with 591 features.
The dataset presented in this case represents a selection of such features where each example represents a single production entity with
associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to
a pass and “1” corresponds to a fail and the data time stamp is for that specific test point
PROJECT OBJECTIVE:
We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.
1.Import and understand the data. [5 Marks]
### Importing Required Libraries
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats; from scipy.stats import zscore, norm, randint
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")
#
A.Import ‘signal-data.csv’ as DataFrame. [2 Marks]
signal_data = pd.read_csv("signal-data.csv") # import the csv file
signal_data.head()
| Time | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008-07-19 11:55:00 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 100.0 | 97.6133 | 0.1242 | 1.5005 | ... | NaN | 0.5005 | 0.0118 | 0.0035 | 2.3630 | NaN | NaN | NaN | NaN | -1 |
| 1 | 2008-07-19 12:32:00 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 100.0 | 102.3433 | 0.1247 | 1.4966 | ... | 208.2045 | 0.5019 | 0.0223 | 0.0055 | 4.4447 | 0.0096 | 0.0201 | 0.0060 | 208.2045 | -1 |
| 2 | 2008-07-19 13:17:00 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 100.0 | 95.4878 | 0.1241 | 1.4436 | ... | 82.8602 | 0.4958 | 0.0157 | 0.0039 | 3.1745 | 0.0584 | 0.0484 | 0.0148 | 82.8602 | 1 |
| 3 | 2008-07-19 14:43:00 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 100.0 | 104.2367 | 0.1217 | 1.4882 | ... | 73.8432 | 0.4990 | 0.0103 | 0.0025 | 2.0544 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | -1 |
| 4 | 2008-07-19 15:22:00 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 100.0 | 100.3967 | 0.1235 | 1.5031 | ... | NaN | 0.4800 | 0.4766 | 0.1045 | 99.3032 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | -1 |
5 rows × 592 columns
signal_data.shape
(1567, 592)
There are are 1567 samples and 592 features /attributes in Signal data
signal_data.dtypes.value_counts()
float64 590 object 1 int64 1 dtype: int64
In the features 590 are float and one integer type and one is object type
B.Print 5 point summary and share at least 2 observations. [3 Marks]
signal_data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1561.0 | 3014.452896 | 73.621787 | 2743.2400 | 2966.260000 | 3011.4900 | 3056.6500 | 3356.3500 |
| 1 | 1560.0 | 2495.850231 | 80.407705 | 2158.7500 | 2452.247500 | 2499.4050 | 2538.8225 | 2846.4400 |
| 2 | 1553.0 | 2200.547318 | 29.513152 | 2060.6600 | 2181.044400 | 2201.0667 | 2218.0555 | 2315.2667 |
| 3 | 1553.0 | 1396.376627 | 441.691640 | 0.0000 | 1081.875800 | 1285.2144 | 1591.2235 | 3715.0417 |
| 4 | 1553.0 | 4.197013 | 56.355540 | 0.6815 | 1.017700 | 1.3168 | 1.5257 | 1114.5366 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 586 | 1566.0 | 0.021458 | 0.012358 | -0.0169 | 0.013425 | 0.0205 | 0.0276 | 0.1028 |
| 587 | 1566.0 | 0.016475 | 0.008808 | 0.0032 | 0.010600 | 0.0148 | 0.0203 | 0.0799 |
| 588 | 1566.0 | 0.005283 | 0.002867 | 0.0010 | 0.003300 | 0.0046 | 0.0064 | 0.0286 |
| 589 | 1566.0 | 99.670066 | 93.891919 | 0.0000 | 44.368600 | 71.9005 | 114.7497 | 737.3048 |
| Pass/Fail | 1567.0 | -0.867262 | 0.498010 | -1.0000 | -1.000000 | -1.0000 | -1.0000 | 1.0000 |
591 rows × 8 columns
2. Data cleansing: [15 Marks]
A. Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature. [5 Marks]
signal_data.shape
(1567, 592)
##copying data for doing the imputation
sdata=signal_data.copy()
sdata.shape
(1567, 592)
def remove_null_columns(data,percentageLimit):
percent_missing = round(data.isnull().sum() * 100 / len(data),2) #percentage of missing value
missing_value_data = pd.DataFrame({'Feature Name': data.columns,
'percent_missing': percent_missing})
featurelist_more_null=missing_value_data.get(percent_missing>percentageLimit)
cols_to_drop=featurelist_more_null['Feature Name']
print(len(cols_to_drop))
data.drop(labels=cols_to_drop, axis=1,inplace=True)
return data
sdata=remove_null_columns(sdata,20)
32
sdata.shape
(1567, 560)
Comments : 32 features which has more than 20% null value are removed
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
imputer = imputer.fit(sdata.iloc[:,1:-1])
## Replacing the Null value from the column 0 to last but one column
sdata.iloc[:,1:-1] = imputer.transform(sdata.iloc[:,1:-1])
sdata.head()
| Time | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 577 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008-07-19 11:55:00 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 100.0 | 97.6133 | 0.1242 | 1.5005 | ... | 14.9509 | 0.5005 | 0.0118 | 0.0035 | 2.3630 | 0.021458 | 0.016475 | 0.005283 | 99.670066 | -1 |
| 1 | 2008-07-19 12:32:00 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 100.0 | 102.3433 | 0.1247 | 1.4966 | ... | 10.9003 | 0.5019 | 0.0223 | 0.0055 | 4.4447 | 0.009600 | 0.020100 | 0.006000 | 208.204500 | -1 |
| 2 | 2008-07-19 13:17:00 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 100.0 | 95.4878 | 0.1241 | 1.4436 | ... | 9.2721 | 0.4958 | 0.0157 | 0.0039 | 3.1745 | 0.058400 | 0.048400 | 0.014800 | 82.860200 | 1 |
| 3 | 2008-07-19 14:43:00 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 100.0 | 104.2367 | 0.1217 | 1.4882 | ... | 8.5831 | 0.4990 | 0.0103 | 0.0025 | 2.0544 | 0.020200 | 0.014900 | 0.004400 | 73.843200 | -1 |
| 4 | 2008-07-19 15:22:00 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 100.0 | 100.3967 | 0.1235 | 1.5031 | ... | 10.9698 | 0.4800 | 0.4766 | 0.1045 | 99.3032 | 0.020200 | 0.014900 | 0.004400 | 73.843200 | -1 |
5 rows × 560 columns
sdata.isnull().sum().sum()
0
B.Identify and drop the features which are having same value for all the rows. [3 Marks]
cols= sdata.columns
cols
Index(['Time', '0', '1', '2', '3', '4', '5', '6', '7', '8',
...
'577', '582', '583', '584', '585', '586', '587', '588', '589',
'Pass/Fail'],
dtype='object', length=560)
samevalue_columns=[]
for col in cols:
if(sdata[col].nunique()==1):##if value count for a column is 1 it signifies same values
samevalue_columns.append(col)##
#print(col)
print('number of columns',len(samevalue_columns))
number of columns 116
print(samevalue_columns)
['5', '13', '42', '49', '52', '69', '97', '141', '149', '178', '179', '186', '189', '190', '191', '192', '193', '194', '226', '229', '230', '231', '232', '233', '234', '235', '236', '237', '240', '241', '242', '243', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '276', '284', '313', '314', '315', '322', '325', '326', '327', '328', '329', '330', '364', '369', '370', '371', '372', '373', '374', '375', '378', '379', '380', '381', '394', '395', '396', '397', '398', '399', '400', '401', '402', '403', '404', '414', '422', '449', '450', '451', '458', '461', '462', '463', '464', '465', '466', '481', '498', '501', '502', '503', '504', '505', '506', '507', '508', '509', '512', '513', '514', '515', '528', '529', '530', '531', '532', '533', '534', '535', '536', '537', '538']
sdata.drop(labels=samevalue_columns, axis=1,inplace=True)
sdata.shape
(1567, 444)
#sdata.to_csv('sdata_cleared_features.csv')
sdata.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1567.0 | 3014.452896 | 73.480613 | 2743.2400 | 2966.66500 | 3011.8400 | 3056.5400 | 3356.3500 |
| 1 | 1567.0 | 2495.850231 | 80.227793 | 2158.7500 | 2452.88500 | 2498.9100 | 2538.7450 | 2846.4400 |
| 2 | 1567.0 | 2200.547318 | 29.380932 | 2060.6600 | 2181.09995 | 2200.9556 | 2218.0555 | 2315.2667 |
| 3 | 1567.0 | 1396.376627 | 439.712852 | 0.0000 | 1083.88580 | 1287.3538 | 1590.1699 | 3715.0417 |
| 4 | 1567.0 | 4.197013 | 56.103066 | 0.6815 | 1.01770 | 1.3171 | 1.5296 | 1114.5366 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 586 | 1567.0 | 0.021458 | 0.012354 | -0.0169 | 0.01345 | 0.0205 | 0.0276 | 0.1028 |
| 587 | 1567.0 | 0.016475 | 0.008805 | 0.0032 | 0.01060 | 0.0148 | 0.0203 | 0.0799 |
| 588 | 1567.0 | 0.005283 | 0.002866 | 0.0010 | 0.00330 | 0.0046 | 0.0064 | 0.0286 |
| 589 | 1567.0 | 99.670066 | 93.861936 | 0.0000 | 44.36860 | 72.0230 | 114.7497 | 737.3048 |
| Pass/Fail | 1567.0 | -0.867262 | 0.498010 | -1.0000 | -1.00000 | -1.0000 | -1.0000 | 1.0000 |
443 rows × 8 columns
C.Drop other features if required using relevant functional knowledge. Clearly justify the same. [2 Marks]
zerocolumns=[]
for col in sdata.columns:
count = (sdata[col] == 0).sum()
if(count>1500):
print('column',col,'number of rows that has zero',count)
zerocolumns.append(col)
column 74 number of rows that has zero 1560 column 114 number of rows that has zero 1545 column 206 number of rows that has zero 1560 column 209 number of rows that has zero 1560 column 249 number of rows that has zero 1545 column 342 number of rows that has zero 1560 column 347 number of rows that has zero 1560 column 387 number of rows that has zero 1545 column 478 number of rows that has zero 1560 column 521 number of rows that has zero 1546
print('initial columns',len(zerocolumns))
i=0
categorical_column=[]
for col in zerocolumns:
if(sdata[col].nunique()==2):
i=i+1
categorical_column.append(col)
print(col)
print('categorical columns for validation',len(categorical_column))
initial columns 10 categorical columns for validation 0
sdata.drop(labels=zerocolumns, axis=1,inplace=True)
sdata.shape
(1567, 434)
D.Check for multi-collinearity in the data and take necessary action. [3 Marks]
By analysis its found that more features had multicollinearilty hence decided to check correlation value and remove them
and applying the VIF to remove high multicollinear columns
correlated_features = set()
correlation_matrix =sdata.corr()
print(correlation_matrix )
0 1 2 3 4 6 \
0 1.000000 -0.143840 0.004756 -0.007613 -0.011014 0.002270
1 -0.143840 1.000000 0.005767 -0.007568 -0.001636 -0.025564
2 0.004756 0.005767 1.000000 0.298935 0.095891 -0.136225
3 -0.007613 -0.007568 0.298935 1.000000 -0.058483 -0.685835
4 -0.011014 -0.001636 0.095891 -0.058483 1.000000 -0.074368
... ... ... ... ... ... ...
586 0.018443 -0.009403 -0.025495 0.034711 -0.043929 -0.041209
587 -0.025880 0.017266 -0.029345 -0.039132 -0.031005 0.034027
588 -0.028166 0.010118 -0.030818 -0.033645 -0.026100 0.032227
589 0.004174 0.044797 -0.032890 -0.080341 0.050910 0.043777
Pass/Fail -0.025141 -0.002603 -0.000957 -0.024623 -0.013756 0.016239
7 8 9 10 ... 577 582 \
0 0.031483 -0.052622 0.009045 0.006504 ... 0.008601 0.000224
1 -0.012037 0.031258 0.023964 0.009645 ... -0.010145 0.043556
2 -0.146213 0.023528 0.016168 0.069893 ... -0.028705 -0.006023
3 0.073856 -0.102892 0.068215 0.049873 ... 0.016438 0.008988
4 -0.347734 -0.025946 0.054206 -0.006470 ... -0.004070 0.045081
... ... ... ... ... ... ... ...
586 0.058113 0.010433 0.033738 0.000327 ... -0.002684 -0.016726
587 -0.021426 0.022845 0.059301 0.046965 ... -0.009405 -0.024473
588 -0.020893 0.026250 0.060758 0.046048 ... -0.015596 -0.020705
589 -0.107804 -0.022770 0.004880 0.008393 ... -0.024766 0.041486
Pass/Fail 0.012991 0.028016 -0.031191 0.033639 ... -0.049633 0.047020
583 584 585 586 587 588 \
0 0.023453 0.019907 0.023589 0.018443 -0.025880 -0.028166
1 0.002904 -0.001264 0.002273 -0.009403 0.017266 0.010118
2 0.015697 0.018225 0.015752 -0.025495 -0.029345 -0.030818
3 0.025436 0.024736 0.026019 0.034711 -0.039132 -0.033645
4 -0.001300 -0.001597 -0.001616 -0.043929 -0.031005 -0.026100
... ... ... ... ... ... ...
586 0.002257 0.001605 0.002743 1.000000 0.167913 0.164238
587 -0.002649 -0.002498 -0.002930 0.167913 1.000000 0.974276
588 -0.002260 -0.001957 -0.002530 0.164238 0.974276 1.000000
589 -0.003008 -0.003295 -0.003800 -0.486559 0.390813 0.389211
Pass/Fail 0.005981 0.005419 0.005034 0.004156 0.035391 0.031167
589 Pass/Fail
0 0.004174 -0.025141
1 0.044797 -0.002603
2 -0.032890 -0.000957
3 -0.080341 -0.024623
4 0.050910 -0.013756
... ... ...
586 -0.486559 0.004156
587 0.390813 0.035391
588 0.389211 0.031167
589 1.000000 -0.002653
Pass/Fail -0.002653 1.000000
[433 rows x 433 columns]
##method to find the correlated columns
for i in range(len(correlation_matrix .columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > 0.7:
colname = correlation_matrix.columns[i]
correlated_features.add(colname)
len(correlated_features)
233
sdata.shape
(1567, 434)
print(correlated_features)
{'287', '567', '36', '431', '452', '341', '361', '360', '298', '350', '470', '205', '339', '524', '541', '336', '296', '354', '527', '199', '312', '123', '551', '174', '279', '351', '335', '349', '425', '46', '413', '391', '125', '576', '333', '334', '271', '545', '320', '440', '127', '362', '490', '540', '165', '66', '140', '416', '471', '494', '254', '96', '337', '283', '405', '447', '430', '27', '427', '367', '454', '288', '302', '355', '130', '437', '523', '332', '187', '22', '252', '568', '294', '60', '386', '477', '496', '522', '584', '148', '453', '426', '343', '164', '198', '429', '467', '569', '539', '299', '457', '409', '204', '50', '278', '285', '474', '557', '406', '357', '554', '469', '368', '428', '552', '304', '338', '290', '388', '497', '553', '560', '154', '280', '417', '441', '197', '390', '392', '407', '421', '525', '319', '439', '445', '308', '324', '303', '556', '555', '104', '412', '272', '435', '456', '434', '207', '70', '203', '340', '424', '26', '493', '573', '105', '196', '39', '353', '17', '393', '51', '356', '224', '273', '317', '526', '309', '321', '411', '297', '585', '35', '34', '442', '291', '289', '163', '376', '30', '282', '436', '459', '491', '147', '359', '479', '306', '348', '480', '588', '377', '577', '566', '475', '202', '275', '286', '281', '316', '473', '365', '301', '65', '185', '101', '331', '575', '277', '415', '444', '318', '300', '366', '420', '520', '305', '549', '152', '311', '455', '344', '155', '106', '574', '561', '443', '310', '363', '98', '270', '54', '352', '389', '446', '448', '274', '307', '408', '410', '323', '295', '124', '495'}
sdata.drop(labels=correlated_features, axis=1, inplace=True)
sdata.shape
(1567, 201)
sdata.dtypes.value_counts()
float64 199 object 1 int64 1 dtype: int64
#### Data preparation for applying VIF
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
labels=['Pass/Fail','Time']
X = sdata.drop(labels= labels , axis = 1)
y = sdata["Pass/Fail"]
print('x shape',X.shape[1])
#vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
#print(vif)
x shape 199
labels=['Pass/Fail','Time']
#newdata = sdata.drop(labels= labels , axis = 1)
newdata = X.copy()
X.shape
(1567, 199)
columnsToDrop=[]
def getVIF(data,VIFThreshold):
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
vif["features"] = data.columns
temp_df = vif.sort_values(by = 'VIF Factor',ascending = False)
df_mask=vif['VIF Factor']>=VIFThreshold
filered_df= vif[df_mask]
print('shape of columns that have more than given Threshold',filered_df.shape)
remove_columns=filered_df['features']
return remove_columns
remove_columns=getVIF(X,10)
shape of columns that have more than given Threshold (118, 2)
columnsToDrop.append(remove_columns)
len(columnsToDrop[0])
118
sdata.head()
| Time | 0 | 1 | 2 | 3 | 4 | 6 | 7 | 8 | 9 | ... | 565 | 570 | 571 | 572 | 582 | 583 | 586 | 587 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008-07-19 11:55:00 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 97.6133 | 0.1242 | 1.5005 | 0.0162 | ... | 0.14561 | 533.8500 | 2.1113 | 8.95 | 0.5005 | 0.0118 | 0.021458 | 0.016475 | 99.670066 | -1 |
| 1 | 2008-07-19 12:32:00 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 102.3433 | 0.1247 | 1.4966 | -0.0005 | ... | 0.14561 | 535.0164 | 2.4335 | 5.92 | 0.5019 | 0.0223 | 0.009600 | 0.020100 | 208.204500 | -1 |
| 2 | 2008-07-19 13:17:00 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 95.4878 | 0.1241 | 1.4436 | 0.0041 | ... | 0.62190 | 535.0245 | 2.0293 | 11.21 | 0.4958 | 0.0157 | 0.058400 | 0.048400 | 82.860200 | 1 |
| 3 | 2008-07-19 14:43:00 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 104.2367 | 0.1217 | 1.4882 | -0.0124 | ... | 0.16300 | 530.5682 | 2.0253 | 9.33 | 0.4990 | 0.0103 | 0.020200 | 0.014900 | 73.843200 | -1 |
| 4 | 2008-07-19 15:22:00 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 100.3967 | 0.1235 | 1.5031 | -0.0031 | ... | 0.14561 | 532.0155 | 2.0275 | 8.83 | 0.4800 | 0.4766 | 0.020200 | 0.014900 | 73.843200 | -1 |
5 rows × 201 columns
len(remove_columns)
118
sdata.drop(labels= remove_columns , axis = 1, inplace=True)
sdata.shape
(1567, 83)
fig, ax = plt.subplots(figsize=(20,20))
#sns.heatmap(df2.corr(), center=0, cmap='BrBG', annot=True)
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(sdata.corr(), dtype=bool))
sns.heatmap(sdata.corr(), center=0, cmap='mako', annot=True, fmt='.2f', linewidths=0.05, mask=mask)
ax.set_title('Heat map of Semiconducter data after removing high collinear columns')
Text(0.5, 1.0, 'Heat map of Semiconducter data after removing high collinear columns')
correlated_features = set()
correlation_matrix =sdata.corr()
print(correlation_matrix )
4 9 10 24 41 59 \
4 1.000000 0.054206 -0.006470 -0.013636 -0.012742 -0.020577
9 0.054206 1.000000 -0.064065 0.014420 -0.042435 -0.026476
10 -0.006470 -0.064065 1.000000 -0.014916 -0.025927 0.085646
24 -0.013636 0.014420 -0.014916 1.000000 -0.005596 0.036681
41 -0.012742 -0.042435 -0.025927 -0.005596 1.000000 -0.007074
... ... ... ... ... ... ...
583 -0.001300 -0.036036 0.039060 -0.010550 -0.000840 -0.027673
586 -0.043929 0.033738 0.000327 0.016466 -0.025716 -0.042800
587 -0.031005 0.059301 0.046965 0.003232 -0.002800 -0.014624
589 0.050910 0.004880 0.008393 -0.016735 0.013800 0.042628
Pass/Fail -0.013756 -0.031191 0.033639 -0.018297 0.002480 0.155771
75 76 77 78 ... 510 511 \
4 0.023082 0.042888 -0.000736 0.051320 ... -0.002970 -0.005096
9 0.049423 0.236932 -0.020110 0.054193 ... -0.032539 -0.032834
10 -0.028658 0.072108 0.010721 -0.031953 ... 0.007579 -0.011040
24 0.053163 -0.025389 0.022202 -0.095334 ... -0.016509 0.008941
41 0.024982 -0.102415 -0.012645 -0.081467 ... 0.027138 0.064621
... ... ... ... ... ... ... ...
583 -0.039981 0.023517 0.037906 0.002076 ... 0.026293 -0.026523
586 -0.004506 -0.032273 0.034370 0.116209 ... -0.040976 -0.058987
587 -0.048725 0.055792 -0.022418 -0.068960 ... 0.025221 0.013058
589 0.008004 0.056562 -0.026583 -0.082986 ... 0.026100 0.049598
Pass/Fail 0.027941 -0.055674 0.005413 -0.043483 ... 0.131587 0.054925
559 565 572 583 586 587 \
4 -0.027123 0.040886 -0.012024 -0.001300 -0.043929 -0.031005
9 0.024042 -0.035400 0.044216 -0.036036 0.033738 0.059301
10 0.001498 0.018570 0.047202 0.039060 0.000327 0.046965
24 -0.033409 -0.021824 -0.020520 -0.010550 0.016466 0.003232
41 -0.026723 0.025003 -0.002044 -0.000840 -0.025716 -0.002800
... ... ... ... ... ... ...
583 0.026368 0.042122 -0.017368 1.000000 0.002257 -0.002649
586 -0.004989 0.028386 -0.008668 0.002257 1.000000 0.167913
587 -0.011375 0.009609 -0.001425 -0.002649 0.167913 1.000000
589 0.010519 0.009927 -0.022672 -0.003008 -0.486559 0.390813
Pass/Fail 0.024099 0.040309 -0.032233 0.005981 0.004156 0.035391
589 Pass/Fail
4 0.050910 -0.013756
9 0.004880 -0.031191
10 0.008393 0.033639
24 -0.016735 -0.018297
41 0.013800 0.002480
... ... ...
583 -0.003008 0.005981
586 -0.486559 0.004156
587 0.390813 0.035391
589 1.000000 -0.002653
Pass/Fail -0.002653 1.000000
[82 rows x 82 columns]
for i in range(len(correlation_matrix .columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > 0.6:
colname = correlation_matrix.columns[i]
correlated_features.add(colname)
len(correlated_features)
0
sdata.shape
(1567, 83)
sdata.head(5)
| Time | 4 | 9 | 10 | 24 | 41 | 59 | 75 | 76 | 77 | ... | 510 | 511 | 559 | 565 | 572 | 583 | 586 | 587 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008-07-19 11:55:00 | 1.3602 | 0.0162 | -0.0034 | 751.00 | 4.515 | -1.7264 | 0.0126 | -0.0206 | 0.0141 | ... | 64.6707 | 0.0000 | 0.4385 | 0.14561 | 8.95 | 0.0118 | 0.021458 | 0.016475 | 99.670066 | -1 |
| 1 | 2008-07-19 12:32:00 | 0.8294 | -0.0005 | -0.0148 | -1640.25 | 2.773 | 0.8073 | -0.0039 | -0.0198 | 0.0004 | ... | 141.4365 | 0.0000 | 0.1745 | 0.14561 | 5.92 | 0.0223 | 0.009600 | 0.020100 | 208.204500 | -1 |
| 2 | 2008-07-19 13:17:00 | 1.5102 | 0.0041 | 0.0013 | -1916.50 | 5.434 | 23.8245 | -0.0078 | -0.0326 | -0.0052 | ... | 240.7767 | 244.2748 | 0.3718 | 0.62190 | 11.21 | 0.0157 | 0.058400 | 0.048400 | 82.860200 | 1 |
| 3 | 2008-07-19 14:43:00 | 1.3204 | -0.0124 | -0.0033 | -1657.25 | 1.279 | 24.3791 | -0.0555 | -0.0461 | -0.0400 | ... | 113.5593 | 0.0000 | 0.7288 | 0.16300 | 9.33 | 0.0103 | 0.020200 | 0.014900 | 73.843200 | -1 |
| 4 | 2008-07-19 15:22:00 | 1.5334 | -0.0031 | -0.0072 | 117.00 | 2.209 | -12.2945 | -0.0534 | 0.0183 | -0.0167 | ... | 148.0663 | 0.0000 | 0.2156 | 0.14561 | 8.83 | 0.4766 | 0.020200 | 0.014900 | 73.843200 | -1 |
5 rows × 83 columns
Comments
signal_df=sdata.copy()
E.Make all relevant modifications on the data using both functional/logical reasoning/assumptions. [2 Marks]
signal_df.shape
(1567, 82)
Hence dropping this column
signal_df.drop(labels='Time',axis=1,inplace=True)
signal_df.shape
(1567, 82)
## Method to find the Low Variance columns
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
# Just make a convenience function; this one wraps the VarianceThreshold
# transformer but you can pass it a pandas dataframe and get one in return
def get_low_variance_columns(dframe=None, columns=None,
skip_columns=None, thresh=0.0,
autoremove=False):
"""
Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
"""
print("Finding low-variance features.")
try:
# get list of all the original df columns
all_columns = dframe.columns
print(len(all_columns))
# remove `skip_columns`
remaining_columns = all_columns.drop(skip_columns)
# get length of new index
max_index = len(remaining_columns) - 1
# get indices for `skip_columns`
skipped_idx = [all_columns.get_loc(column)
for column
in skip_columns]
# adjust insert location by the number of columns removed
# (for non-zero insertion locations) to keep relative
# locations intact
for idx, item in enumerate(skipped_idx):
if item > max_index:
diff = item - max_index
skipped_idx[idx] -= diff
if item == max_index:
diff = item - len(skip_columns)
skipped_idx[idx] -= diff
if idx == 0:
skipped_idx[idx] = item
# get values of `skip_columns`
skipped_values = dframe.iloc[:, skipped_idx].values
# get dataframe values
X = dframe.loc[:, remaining_columns].values
# instantiate VarianceThreshold object
vt = VarianceThreshold(threshold=thresh)
# fit vt to data
vt.fit(X)
# get the indices of the features that are being kept
feature_indices = vt.get_support(indices=True)
# remove low-variance columns from index
feature_names = [remaining_columns[idx]
for idx, _
in enumerate(remaining_columns)
if idx
in feature_indices]
# get the columns to be removed
removed_features = list(np.setdiff1d(remaining_columns,
feature_names))
print("Found {0} low-variance columns."
.format(len(removed_features)))
# remove the columns
if autoremove & len(removed_features)>1:
print("Removing low-variance features.")
# remove the low-variance columns
X_removed = vt.transform(X)
print("Reassembling the dataframe (with low-variance "
"features removed).")
# re-assemble the dataframe
dframe = pd.DataFrame(data=X_removed,
columns=feature_names)
# add back the `skip_columns`
for idx, index in enumerate(skipped_idx):
dframe.insert(loc=index,
column=skip_columns[idx],
value=skipped_values[:, idx])
print("Succesfully removed low-variance columns.")
# do not remove columns
else:
print("No changes have been made to the dataframe.")
except Exception as e:
print(e)
print("Could not remove low-variance features. Something "
"went wrong.")
pass
return dframe, removed_features
signal_df.columns
Index(['4', '9', '10', '24', '41', '59', '75', '76', '77', '78', '79', '80',
'81', '82', '91', '92', '93', '94', '95', '99', '100', '102', '107',
'108', '129', '135', '139', '144', '151', '153', '156', '159', '160',
'161', '162', '171', '177', '184', '195', '201', '210', '211', '212',
'213', '214', '217', '219', '221', '222', '223', '227', '228', '248',
'251', '253', '418', '419', '432', '433', '438', '468', '476', '482',
'483', '484', '485', '486', '487', '488', '489', '499', '500', '510',
'511', '559', '565', '572', '583', '586', '587', '589', 'Pass/Fail'],
dtype='object')
skip_columns=['Pass/Fail']
dframe, removed_features=get_low_variance_columns(signal_df,signal_df.columns,skip_columns,0.0,True)
Finding low-variance features. 82 Found 0 low-variance columns. No changes have been made to the dataframe.
len(removed_features)
0
so_numeric_df =signal_df.select_dtypes(include=[int,float])
features=so_numeric_df.columns
print(features)
print(len(features))
Index(['4', '9', '10', '24', '41', '59', '75', '76', '77', '78', '79', '80',
'81', '82', '91', '92', '93', '94', '95', '99', '100', '102', '107',
'108', '129', '135', '139', '144', '151', '153', '156', '159', '160',
'161', '162', '171', '177', '184', '195', '201', '210', '211', '212',
'213', '214', '217', '219', '221', '222', '223', '227', '228', '248',
'251', '253', '418', '419', '432', '433', '438', '468', '476', '482',
'483', '484', '485', '486', '487', '488', '489', '499', '500', '510',
'511', '559', '565', '572', '583', '586', '587', '589'],
dtype='object')
81
signal_df.head()
| 4 | 9 | 10 | 24 | 41 | 59 | 75 | 76 | 77 | 78 | ... | 510 | 511 | 559 | 565 | 572 | 583 | 586 | 587 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.3602 | 0.0162 | -0.0034 | 751.00 | 4.515 | -1.7264 | 0.0126 | -0.0206 | 0.0141 | -0.0307 | ... | 64.6707 | 0.0000 | 0.4385 | 0.14561 | 8.95 | 0.0118 | 0.021458 | 0.016475 | 99.670066 | -1 |
| 1 | 0.8294 | -0.0005 | -0.0148 | -1640.25 | 2.773 | 0.8073 | -0.0039 | -0.0198 | 0.0004 | -0.0440 | ... | 141.4365 | 0.0000 | 0.1745 | 0.14561 | 5.92 | 0.0223 | 0.009600 | 0.020100 | 208.204500 | -1 |
| 2 | 1.5102 | 0.0041 | 0.0013 | -1916.50 | 5.434 | 23.8245 | -0.0078 | -0.0326 | -0.0052 | 0.0213 | ... | 240.7767 | 244.2748 | 0.3718 | 0.62190 | 11.21 | 0.0157 | 0.058400 | 0.048400 | 82.860200 | 1 |
| 3 | 1.3204 | -0.0124 | -0.0033 | -1657.25 | 1.279 | 24.3791 | -0.0555 | -0.0461 | -0.0400 | 0.0400 | ... | 113.5593 | 0.0000 | 0.7288 | 0.16300 | 9.33 | 0.0103 | 0.020200 | 0.014900 | 73.843200 | -1 |
| 4 | 1.5334 | -0.0031 | -0.0072 | 117.00 | 2.209 | -12.2945 | -0.0534 | 0.0183 | -0.0167 | -0.0449 | ... | 148.0663 | 0.0000 | 0.2156 | 0.14561 | 8.83 | 0.4766 | 0.020200 | 0.014900 | 73.843200 | -1 |
5 rows × 82 columns
signal_df['Pass/Fail'].replace({-1:0},inplace=True)
signal_df.head()
| 4 | 9 | 10 | 24 | 41 | 59 | 75 | 76 | 77 | 78 | ... | 510 | 511 | 559 | 565 | 572 | 583 | 586 | 587 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.3602 | 0.0162 | -0.0034 | 751.00 | 4.515 | -1.726400 | 0.0126 | -0.0206 | 0.0141 | -0.0307 | ... | 64.670700 | 0.0000 | 0.4385 | 0.145610 | 8.95 | 0.01180 | 0.021458 | 0.016475 | 99.670066 | 0 |
| 1 | 0.8294 | -0.0005 | -0.0148 | -1640.25 | 2.773 | 0.807300 | -0.0039 | -0.0198 | 0.0004 | -0.0440 | ... | 107.584525 | 0.0000 | 0.1745 | 0.145610 | 5.92 | 0.02230 | 0.009600 | 0.020100 | 208.204500 | 0 |
| 2 | 1.5102 | 0.0041 | 0.0013 | -1916.50 | 4.739 | 13.627425 | -0.0078 | -0.0326 | -0.0052 | 0.0213 | ... | 107.584525 | 244.2748 | 0.3718 | 0.285575 | 11.21 | 0.01570 | 0.048825 | 0.034850 | 82.860200 | 1 |
| 3 | 1.3204 | -0.0124 | -0.0033 | -1657.25 | 1.475 | 13.627425 | -0.0555 | -0.0461 | -0.0400 | 0.0400 | ... | 107.584525 | 0.0000 | 0.7288 | 0.163000 | 9.33 | 0.01030 | 0.020200 | 0.014900 | 73.843200 | 0 |
| 4 | 1.5334 | -0.0031 | -0.0072 | 117.00 | 2.209 | -11.145175 | -0.0534 | 0.0183 | -0.0167 | -0.0449 | ... | 107.584525 | 0.0000 | 0.2156 | 0.145610 | 8.83 | 0.02385 | 0.020200 | 0.014900 | 73.843200 | 0 |
5 rows × 82 columns
signal_df_copy=signal_df.copy()
signal_df_before=signal_df.copy()
signal_df_im=signal_df.copy()
def IQR_capping(data,cols,factor):
for col in cols:
#print('before outlier')
#sns.boxplot(data[col],data=data)
q1=data[col].quantile(0.25)
q3=data[col].quantile(0.75)
IQR=q3-q1
#print('IQR',IQR)
upper_whisker= q3+(factor*IQR)
lower_whisker =q1-(factor*IQR)
#print('upper_whisker is',upper_whisker,'lower_whisker is',lower_whisker)
#if column value greater than upper whisker cap it with upper whisker
#else if value less than lower whisker cap with it lower whisker
# if value is not greater than upper whisker and not lower than lower whishker retain the value
data[col]=np.where(data[col]>upper_whisker,upper_whisker,np.where(data[col]<lower_whisker,lower_whisker,data[col]))
#print("After Outlier")
#sns.boxplot(data[col],data=data)
return data
signal_df_im=IQR_capping(signal_df,features,1.5)
#Boxplot to check for outliers
plt.figure(figsize=(50, 50))
col = 1
for i in signal_df_before.columns:
plt.subplot(11,8, col)
sns.boxplot(signal_df_before[i])
col += 1
plt.show()
## After removing outlier
#Boxplot to check for outliers
plt.figure(figsize=(50, 50))
col = 1
for i in signal_df_im.columns:
plt.subplot(11,8, col)
sns.boxplot(signal_df_im[i],color='green')
col += 1
plt.show()
After capping method the outliers are removed.
Comments:
By capping method all the outliers are removed from the chosenn columns of Signaldata dataset
signal_df_im['Pass/Fail'].head()
0 0 1 0 2 1 3 0 4 0 Name: Pass/Fail, dtype: int64
signal_df_im.groupby(["Pass/Fail"]).count()
| 4 | 9 | 10 | 24 | 41 | 59 | 75 | 76 | 77 | 78 | ... | 500 | 510 | 511 | 559 | 565 | 572 | 583 | 586 | 587 | 589 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pass/Fail | |||||||||||||||||||||
| 0 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | ... | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 |
| 1 | 104 | 104 | 104 | 104 | 104 | 104 | 104 | 104 | 104 | 104 | ... | 104 | 104 | 104 | 104 | 104 | 104 | 104 | 104 | 104 | 104 |
2 rows × 81 columns
3. Data analysis & visualisation: [5 Marks]
A. Perform a detailed univariate Analysis with appropriate detailed comments after each analysis. [2 Marks]
For univariate analysis there are various ways. following are the common ways
signal_df_im.shape
(1567, 82)
signal_df_im.isnull().sum().sum()
0
signal_df_im.dtypes.value_counts()
float64 81 int64 1 dtype: int64
signal_df_im.columns
Index(['4', '9', '10', '24', '41', '59', '75', '76', '77', '78', '79', '80',
'81', '82', '91', '92', '93', '94', '95', '99', '100', '102', '107',
'108', '129', '135', '139', '144', '151', '153', '156', '159', '160',
'161', '162', '171', '177', '184', '195', '201', '210', '211', '212',
'213', '214', '217', '219', '221', '222', '223', '227', '228', '248',
'251', '253', '418', '419', '432', '433', '438', '468', '476', '482',
'483', '484', '485', '486', '487', '488', '489', '499', '500', '510',
'511', '559', '565', '572', '583', '586', '587', '589', 'Pass/Fail'],
dtype='object')
def plotDistPlot(data,col):
sns.set_theme();
plt.rcParams['figure.figsize'] = (7, 5)
sns.distplot(data[col])
plt.title('Distribution Plot for Feature '+col, fontsize = 20)
plt.xlabel('Measurement of feature '+col)
plt.show()
plotDistPlot(signal_df_im,'4')
plotDistPlot(signal_df_im,'201')
plotDistPlot(signal_df_im,'214')
plotDistPlot(signal_df_im,'484')
plotDistPlot(signal_df_im,'Pass/Fail')
plotDistPlot(signal_df_im,'418')
plotDistPlot(signal_df_im,'572')
Comments
Mostly standard distribution is observed. few measurements show bimodal distribution and right skewed.
signal_df_im['Pass/Fail'].value_counts()
0 1463 1 104 Name: Pass/Fail, dtype: int64
import seaborn as sns
sns.countplot(x='Pass/Fail', data=signal_df_im)
plt.title('Number of Pass Fail')
plt.ylabel('count')
plt.xlabel('Pass_fail')
plt.xticks(rotation = 90)
plt.show();
plt.title("Pie chart for Pass/Fail ")
colors = sns.color_palette('pastel')[0:5]
signal_df_im['Pass/Fail'].value_counts().plot.pie(autopct='%1.2f%%',shadow=True, colors=colors)
plt.show()
Comments:
The target value is highly imbalanced. Only around 7% data has the expected target of "Pass" value.
import matplotlib.pyplot as plt
signal_df_im1=signal_df_im.select_dtypes(exclude=['object'])
for column in signal_df_im1:
plt.figure(figsize=(17,1))
sns.boxplot(data=signal_df_im1, x=column)
df_1 = signal_df_im.iloc[:,0:20]
df_1['Pass/Fail']=signal_df_im['Pass/Fail']
df_2 = signal_df_im.iloc[:,20:40]
df_2['Pass/Fail']=signal_df_im['Pass/Fail']
df_3 = signal_df_im.iloc[:,40:60]
df_3['Pass/Fail']=signal_df_im['Pass/Fail']
df_4 = signal_df_im.iloc[:,60:82]
#df_1['Pass/Fail']=signal_df_im['Pass/Fail']
df_4.head()
| 468 | 476 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | ... | 510 | 511 | 559 | 565 | 572 | 583 | 586 | 587 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 311.6377 | 31.989300 | 613.3069 | 291.4842 | 494.6996 | 178.1759 | 843.1138 | 0.0000 | 53.1098 | 0.0000 | ... | 64.670700 | 0.0000 | 0.4385 | 0.145610 | 8.95 | 0.01180 | 0.021458 | 0.016475 | 99.670066 | 0 |
| 1 | 463.2883 | 30.864300 | 0.0000 | 246.7762 | 0.0000 | 359.0444 | 130.6350 | 820.7900 | 194.4371 | 0.0000 | ... | 107.584525 | 0.0000 | 0.1745 | 0.145610 | 5.92 | 0.02230 | 0.009600 | 0.020100 | 208.204500 | 0 |
| 2 | 21.3645 | 13.392300 | 434.2674 | 151.7665 | 0.0000 | 190.3869 | 746.9150 | 74.0741 | 191.7582 | 250.1742 | ... | 107.584525 | 244.2748 | 0.3718 | 0.285575 | 11.21 | 0.01570 | 0.048825 | 0.034850 | 82.860200 | 1 |
| 3 | 24.2831 | 35.432300 | 225.0169 | 100.4883 | 305.7500 | 88.5553 | 104.6660 | 71.7583 | 0.0000 | 336.7660 | ... | 107.584525 | 0.0000 | 0.7288 | 0.163000 | 9.33 | 0.01030 | 0.020200 | 0.014900 | 73.843200 | 0 |
| 4 | 44.8980 | 41.827975 | 171.4486 | 276.8810 | 461.8619 | 240.1781 | 0.0000 | 587.3773 | 748.1781 | 0.0000 | ... | 107.584525 | 0.0000 | 0.2156 | 0.145610 | 8.83 | 0.02385 | 0.020200 | 0.014900 | 73.843200 | 0 |
5 rows × 22 columns
B. Perform bivariate and multivariate analysis with appropriate detailed comments after each analysis. [3 Marks]
##df.boxplot(column = 'area_mean', by = 'diagnosis');
#plt.title('Box Plot for Bivariate analysis with Target variable')
#Boxplot to check for outliers
plt.figure(figsize=(50, 50))
col = 1
for i in signal_df_im.columns:
plt.subplot(12,8, col)
sns.boxplot(y=i, x='Pass/Fail',data=signal_df_im)
col += 1
Observation:
* For analysing the data based on the Target value of Pass/ Fail bivariate analysis is done by box plot analysis <br>
* Blue represents fail points and orange represent pass point<br>
* General obsevation is pass data points has high IQR than the fail points for most of the points<br>
* Median point is almost same for both kind of data points for most of the features<br>
* Data points are not in different ranges<br>
# Box Plots
f, (ax) = plt.subplots(1, 1, figsize=(12, 4))
f.suptitle('Pass/Fail based on signal 4', fontsize=14)
sns.boxplot(x="Pass/Fail", y="4", data=signal_df_im, ax=ax)
ax.set_xlabel("Pass/Fail",size = 12,alpha=0.8)
ax.set_ylabel("Signal 4 %",size = 12,alpha=0.8)
Text(0, 0.5, 'Signal 4 %')
Observation
Median is same for Pass and Fail value for the feature 4 and failed feature is left skewed
# Box Plots
f, (ax) = plt.subplots(1, 1, figsize=(12, 4))
f.suptitle('Pass/Fail based on signal 9', fontsize=14)
sns.boxplot(x="Pass/Fail", y="9", data=signal_df_im, ax=ax)
ax.set_xlabel("Pass/Fail",size = 12,alpha=0.8)
ax.set_ylabel("Signal 9 %",size = 12,alpha=0.8)
Text(0, 0.5, 'Signal 9 %')
Observation
Median is same for Pass and Fail value for the feature 9 and both boxplots are normally distributed. IQR is high in Failed cases.
# Box Plots
f, (ax) = plt.subplots(1, 1, figsize=(12, 4))
f.suptitle('Pass/Fail based on signal 41', fontsize=14)
sns.boxplot(x="Pass/Fail", y="41", data=signal_df_im, ax=ax)
ax.set_xlabel("Pass/Fail",size = 12,alpha=0.8)
ax.set_ylabel("Signal 41 %",size = 12,alpha=0.8)
Text(0, 0.5, 'Signal 41 %')
Observation :
For more analysis few of the features are visualuazied in detail for analysing
Median is same for Pass and Fail value for the feature 4 and passed feature is Right skewed
sns.pairplot(df_1 , hue="Pass/Fail" , diag_kind = 'kde')
<seaborn.axisgrid.PairGrid at 0x1bb301a3850>